chore(audit): audit hash expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1#4476
chore(audit): audit hash expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1#4476andygrove wants to merge 1 commit into
Conversation
…, 4.1.1 Add per-version audit sub-bullets to `crc32`, `hash`, `md5`, `sha`, `sha1`, `sha2`, and `xxhash64` in `docs/source/contributor-guide/spark_expressions_support.md`. `sha` is a registry alias of `Sha1`. Spark 4.0 only adds the `DefaultStringProducingExpression` trait and the `nullIntolerant` field refactor across this category; no runtime behaviour change. Apply support-level consistency fixes surfaced by the audit: - Refactor `HashUtils` to return reasons (`unsupportedReasonFor`, `supportLevelForChildren`, `unsupportedReasons`) instead of calling `withInfo` from inside the helper. The recursive type check no longer side-effects on the expression tree at type-check time. - `CometXxHash64`, `CometMurmur3Hash`, `CometSha1`, `CometSha2`: override `getSupportLevel` and `getUnsupportedReasons` so the unsupported-child-type and (for Sha2) the non-foldable-numBits restrictions reach the dispatcher and the compatibility doc. No correctness divergences were found, so no new tracking issues are filed. The known `TimeType` gap (Spark 4.0+) is covered by the existing apache#4418 EPIC; the `DecimalType`-precision-18 gap is a documented Spark semantic difference (BigDecimal hashing).
|
No deferred follow-up work from this audit. The two known limitations (TimeType for Spark 4.0+, DecimalType precision > 18) are already declared via |
| private def unsupportedReasonFor(dt: DataType): Option[String] = dt match { | ||
| case d: DecimalType if d.precision > 18 => Some(unsupportedDecimalReason) | ||
| case s: StructType => | ||
| s.fields.iterator.flatMap(f => unsupportedReasonFor(f.dataType).iterator).toSeq.headOption |
There was a problem hiding this comment.
If unsupportedReasonFor(f.dataType).iterator returns None for the first element of fields, will s.fields.iterator.flatMap(f => unsupportedReasonFor(f.dataType).iterator).toSeq.headOption return None?
Do we need to make sure all fields return None?
There was a problem hiding this comment.
This will return the first Some item, or None if they are all None
Which issue does this PR close?
Closes #.
Rationale for this change
Continuation of the per-category expression audit. Same pattern as #4475 (conditional), #4474 (misc), #4473 (collection), #4470 (json), #4469 (struct), using the updated
audit-comet-expressionskill in #4468.What changes are included in this PR?
Support-doc audit notes
Add per-version audit sub-bullets to
crc32,hash,md5,sha,sha1,sha2, andxxhash64.shais a registry alias ofSha1. Spark 4.0 only adds theDefaultStringProducingExpressiontrait and thenullIntolerant: Booleanfield refactor on the fourString-producing expressions (Md5,Sha1,Sha2,Crc32); no runtime behaviour change across the category.Support-level consistency fixes (in
hash.scala)HashUtilsto return reasons (unsupportedReasonFor,supportLevelForChildren,unsupportedReasons) instead of callingwithInfofrom inside the helper. The recursive type check no longer side-effects on the expression tree at type-check time, which the audit skill calls out as the canonical antipattern.CometXxHash64,CometMurmur3Hash,CometSha1,CometSha2: overridegetSupportLevelandgetUnsupportedReasonsso the unsupported-child-type and (forSha2) the non-foldable-numBitsrestrictions reach both the dispatcher's EXPLAIN message and the compatibility doc generator.Tracking issues filed for follow-up
None. The
TimeTypegap (Spark 4.0+) is covered by the existing #4418 EPIC; theDecimalType-precision-18 gap is a documented semantic difference (Spark hashes via JavaBigDecimal), already declared by the newHashUtils.unsupportedReasons.Audit process
Audited directly using the
audit-comet-expressionskill (4 Spark versions per #4468). Four serde objects plus the sharedHashUtilshelper.How are these changes tested?
./mvnw test -Dsuites="org.apache.comet.CometHashExpressionSuite" -Dtest=none(37 tests pass)make coresucceeds with the serde refactor.